DOCUHBVT BBSUHE 

BD 134 609 95 Tl 006 015 ^ 

AOTHOR Brume Michael E. 

TITLE Applying Bayesian Statistics to Educational 

Evaluation. Theoretical Paper Ho. 62. 
IHSTITOTIOH Wisconsin Dniv*# Hadison* Research and Development 

Center for Cognitive Learning. 
isPOHS AGEHCY National Inst, of Education (DHEW) , Washington,, p.C. 
PUB DATE . Bay 76 

CONTRACT HE-C-00-3-0065 . . 

NOTE ' 32p. 

EDRS PRICE HF-$0.83 BC-$2.06 Plus Postage. 

DESCRIPTORS *Bayesian Statistics; Decision Baking; Educational 

Research; ♦Evaluation; Hypothesis Testing; 
♦Probability; Statistics 

IDEHTIFIEBS Statistical Inference 

ABSTRACT 

Bayesian statist ic?il inference is unfamiliar to many 
educational evaluators. While the classical model is useful in 
educational research^ it is not as useful in evaluation because of 
the need to identify solutions to practical problems based on a wide 
spectrum of information. The reason Bayesian analysis is effective 
for decision making is that it defines probability as a measure of 
opinion or belief , rather than as long-term frequency. Defining , 
probability as a measure of opinion or belief enables the Bayesian 
investigator to consider a wider range of iiiformation than is ^ 
possible with the traditional model. Personal expertise, logical 
analysis,' and. soft data from a wide variety of sources serve to shape 
opinion about a state of nature, with experimental data providing 
additional information either for or against the prior opinion of the 
evaluator. In classical statistics, prior knowledge or opinion is ^ 
ignored. However, , when practical decisions must be made tlie Bayesian 
stresses that all knowledge should be brought to bear on the problem - 
rather than just an isolated set of data. Because of th6 
decision- making orientation of the evaluator, the Bayesian model 
should be considered as an alternative to classical inference. Since 
the Bayesian model views probability as a measure of opinion rather 
than as a long-term frequency, the statistical requirements for it 
are actually greater than for the classical statistician. Use of a 
wider range of distributions than with classical statistics demands 
more Statistical skills than many evaluators currently possess. 
However, the questions raised by the Bayesian model are useful even 
if the model is not totally adopted. (Author/RC) 

♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦^ 

♦ Documents acquired by ERIC include many informal unpublished ♦ 
♦materials not available from other sources* ERIC makes every effort ♦ 

♦ to obtain the best copy available. .Nevertheless, items of marginal, ♦ 

♦ reproducibility are often encountered and this affects the quality ♦ 

♦ of the microfiche and hardcopy reproductions ERIC makes available ♦ 

♦ via the ERIC'Document Reproduction Servic^ (EDRS). EDRS is not ♦ 

♦ responsible for the quality of the original document. Reproductions ♦ 

♦ supplied by EDRS are the be^t that can be made from the origina'l. ♦ 



o 
so 



1 '■ 



THEORETICAL 'PAPER NO. 62 



1^ Jpplying bayesiqn 
statistics to 
educational 
evaluation 



in 



AD 



"* EOUCATION 

V„,S DOCUMENT "*5„£°=,vE0 FBO** 
?^^^EBSON OB OBGAN^Z*^^ oP.N.ONS 



MAY 1976 



WISCONSIN RESEARCH 
AND DEVELOPMENT 
CENTER FOR 
COGNITIVE LEARNING 




ERiC" 



Theoretical Paper No. -62 

APPLYING BAYESIAN STATISTICS' TO 
EDUCATIONAL 'EVALUATION ? 



^ 'by 

. Michael E. Brxanet 



Report from the Project on 
IGEf Evaluation 



Conjlbd G. Katzenmeyer 
Facoity Asso&iate 



I 



Wisconsin Research and Development 
Center for Cognitive Learning 
The University of Wisconsin 
Madison^ Wisconsin 



May 1976 



> 

Published by the Wisconsin Research and Deivelopment Center for Cognitive Learning, 
supported in parf^as ^a research and development centej? by ftinds from the National 
Institute of Education, Department of Health, Education, and Welfare- The opinions 
e'xpressed herein do not necessaMly reflect the position or policy of the NatTonal 
Institute of Education and no official endorsement by th»t agency should be inferred. 

Center Contract No. NE-C- 00-3-0065 ^ 




irflSCONSIN RESEARCH AND DEVELOPMENT 
\ : CENTER FOR COGNITIVE LEARNING 



MISSION 

The mission of the Wisconsin Research and Development Center 
for Cognitive Learning is to help learners develop as. rapid.ly 
2Ui*d effectively as possible their potential as human beings 
and as contributing m«ibers of S9ciety. The R&D penter is 
striving to fulfill this goal by ^ # 

# conducting research to discover more about 
how children learn 

/• developing improved instructional strategies, ^ 
processes emd materials for school administrators, 
teachers, emd children, and 

• offering assistance to educators and citizens 
which will help transfer the outcomes of research 
and development into pratptice 



PROGRAM 

TKe activities of the Wisconsin R&D Center are organized 
aroiand one unifying theme. Individually Guilled Education. 




FUNDING 

The Wisconsin R&D Center is supported with funds from the 
Nationad Institute of Education; the Btireau of Education ^for 
the Handicapped, U.S. Office of Education; and the University 
,of Wisconsin. 



ACKNOWLEDGMENTS 

I wish to express nty appreciation to Conrad Katzenmeyer ' f or 
providing the opportunity for .me to write this j)aper and for reviewing 
each draft. My gratitude is' also extended to Steven Jurs and Joseph. 
Shaffer who were stimulators for sane of the ideas presented in this - 
paper • 




6 



IV 



TABLE OF CONTENTS , ■ • - • ' ' 

Page 

^knowledgements . . . . . . • • • • 

List of Tables vii ; 

Abstract ^^c 

^ - ^ ■ 

1 



I . Introduction 



II. 



The Bayesian versus 



Classical Inference Controversy 1 



Foundations pf Bayesian Statistics 



1 



Educational Evaluation ^ % • 

. -J • ' . .... 

f Development of Educational Evaluation • • / 

.Evaluation versus Research . . ^ T * * * ^ 

Failure of Classical Statistics in Educational Evaluation 5 



3 



^ III . Bayesian Analysis 



.4 



9 



Definitions of"^obabiXity 7 

Bayes* Theorem . • - - • • • ^ 

^Fortnation of Prior Probability Distribution . . . - - • • • 

..An Example of Bayes' Theorem - - • • 

^Hypothesis Testing • • 12 

Decision MaJdLng '14 

Bayesian Analysis for an Evaluation 

Problem-'-An Example • . * . 15 

Data CollectioA Stopping Rule ^. .... .... 17 



LIST OF TABLES 



Table 



Page 



Pri oy, j jaJbay and Posterior Probability Distributions 
f0f^ath Program Example . % . . . 



16 



8 



vii 



ERIC 



1 



) 



% • ^ ^ ABSTRACT , 

Because of the curi/^nt dominamce o£ classical statistics in the 
tradition of Fisher, Neyidan, and Pearson, an alternative approach known 
as Bayesian statistical /inference is unfamiliar: to many educational evalua- 
tors. While the classical model is useful in educational research, it is 
not. as Useful in 'evaluaiiion because of the need to identify solutions to 
practical prc^lems based on a wide spectrum of information. 

Business amd marketing researchers have utilized the 9ayesiaui model 
for many years because they need to make pract;tcal decisions rather than 
assertions about some unknown parameter^ whiedn is the function of tradi- 
tional statistics. The reasoh Bayesian araiysia^is effective for decision 
making is that it defines probaibility as a measure of opinion or belief , 
ra^ther than as long/- t|rm. frequency. . 

Defining proliability as ^ measure of opinidn or belief enables the 
Bayesiam inve$tigator to consider a wider ramge of information than is 
possible with jthe^ traditional model. Personal expertise, logical analysis, 
and soft data from a wide variety of sources serve to shape opinion about 
a state \>f natur'e, with experimental data providing additional information 
either f oi? <4r against the prior opinion of trfie evalxiator. In classical . ' 
statistics, pfior knowledge or opinion is ignored. However, when practical 
decisions must be made the Bayesian stresses that all knowledgfe should be 
brought to bear on *the problem rather thdn just an isolated set of data. 
Because of the decision-making orientation of t;he evaluator, the Bayesian 
model should be considered as am alternative to classical iiirfe^rehce. 

Since the Bayesian model views probability as a measure of opinion 
rather than as a long-term frequency^ the statistical- requirements for 
it are actually greater than for the classical statistician. Use ot a 
wider ramge of distributions thaui with classical statistics demauids more 
statistical skills than pany evaliiators currently possess. Hovfeve^, the 
questions raised by the Bayesian model ^are useful even if the model is 
not totally adopted. 



^INTRODUCTION 

THE BAYESIAN VERSUS CLASSICAl/lNFERENCE CONTROVERSY 

Throughout t i sequence of statistical co\irses digested by most 
prospective educational evaluators, little notice is usually given to 
^he controversy between so called Neyman-Pearson and Bayesian statis- 
ticicms. . This lack of awareness is no doubt due to the dominant position 
the Neyman-Pearson (classical) statistician has enjoyed in so^al 
4cience and educational research. However, in some fields, where practical 
scisions must be made on the basis of all availeJ^le information the 
lyesian statistical model Y\as proven its usefulness. In business and 
Irketing research and. to a lesser extent in engineering Bayesian euialysls 
^as been effectively utilized ..to de^ermirji^ the appropriateness of alterna-.i 
ftive decision choicei^. , 

The basic philosophical difference between the two ap^oaches -concerns 
the use of prior information or beliefs. The classical statiV^iyC^-an ' ^ 
ass.umes. that only specified data fathered after hypothesis formation can 
be used for inference. The Bayesian statistician contends that N^e data 
gathered in^ an experiment only serve as additional information to^^je ^ 
combined with the investigator's prior information or, beliefs. This^com- 
bining of data and opinion is done through the use of Bayes ' Theorem^' To 
most educational evaluators the Bayesian approach may seen foreign to all 
they have learned cdDout the appropriate use of statistical infi^rence^. 

The purpose of this pape.r is to show, first, that^ the differences 
'-^-ityetween educational research and educational evaluation result in .the 
-conclusion that statistical techniques appropriate for the former are 
not necessarily suitable tor the latter; and, second, that the Bayesicm 
d^nferential approach offers an alternative statistical model for the 
educational evaluator, as he/^e is frequently in the position where 
classical inferenti^ statistics does not allow for utilization of the 
type 'of information wfiich he/she possesses. ^ . 



FOUNDATIONS OF BAYESIAN STATISTICS / ^^Hl, . 

The classical (Neyman-Pearson) ^^jrsus^^yesian controversy can be 
relat^^kto a^'basic problem in the histSry Of^^scierice: the roles of 
rationalism and empiricism and* the interpretation of probability 
statements (Weber, 1973). 

To the Gi^eeks, the laws of science were completely precise and 
demonstrciiSe through the? process of deduction. Fluctuation and varia- 
bility were considered error and a reflection of lack of knowledge of 





laws. This (Jetemipistic view of knowledge wai assumed by early philo- > 
sophers of adteTlce sucn as Descartes and scierycists such as Newton. 
However > afi science maUjired there was less alad less certainty that a 
rationaJt^^i^stem could beNi^veloped from which any data could be seen 
as logitalN^pnsequence. Modei;;^ stochastic models such as Maxwell's 
thermodyneunic laws, Mendel's genetic laws,, and Einstein's theory of 
relativity saw science developing prob^ffility models of phenomena. 

Probcd^ility models which^ have been recognized for a long time were 
used to cope with ignorance cQjout laws of nature or errors of measure-^ 
ment. However, as stochastic models beceune mo2?e popular, probcQ^ility 
models t?ere seen as characteristic of nature itself rather than "feimply 
teflecifing ignorance. Thus, modern philosophers were forced to reconBi 
the alternatives that were availcd^le to the conc^jJt of probcU^ility .' F 
the Bayesian, probcQ^ility means de^fi^e of personal belief cQ^out some 
nomenon. ' This approach contrasts distinctly with that of the classical 
statistical school, which considers prpbabij.ity to, be long-term relative 
frequency. The probcQ^ility of occurrence of an event has been defined' 
as the limit of the relative frequency of its occurrence in some speci- 
fied reference class of events (Fisher, 1956). With few exceptions, 
modern statistical textbooks use this relative frequency interpretation 
of probability. 

Bayesians find this view oft probcQ^ility too restrictive. Often 
statements must be made cQDOut noijrepeating events which have a deg;ree - 
pf uncegrtainty . For excimple, the statement "The probcJDility is' q^ate^r 
that a man will land on Mars them that a man will Icuid on ^Jupit'a^l^ 
makes intuitive sense, yet both events are vmique or nonrepeatiil^p: By - 
viewing probcQ^ility as a measure of feelief , the concept tcOces on broader 
and potentially more useful meaining. ' 

Baye^lari statistics stems indirectly from a paper^ by, Thomas' Bayes 
.that originally a'^peared in 1763; however, only in 1:961 did the first ^ 
systematic us^ of subjective probcQ^ility ant} other elements of the 
Bayesian model emerge with the appearance of Schlaifer's Introduction 
to' Statistics for Business Decisions . Since then, several additional 
books with a general business application emphasis have appeai:ed, such 
as Savage (1962) Thompson (1972), and Zellner (1971). However, Edwards, 
Lindman, and Savage^*: (1963) and McGee (1971) have presented Bayesian \ 
approaches to social sciences, amd Meyer (1971) presented a general 
paper on Bayesiem statistics at the 197i American Educational Research * 
Association meeting .*^his "author has not found any additional publica- 
tions relating the model to educational situatior)s. In this paper, the 
potential for the use of Bayesian analyses in educational evaluation 
will be discussed. » r • ' 




It 



'educational EVALUATION 
DEVELOPMENT OF EmjCATHWAL EVALUATION 




Evaluation models developed in the past decade ^because 

the need to account for emd determine the worth of scores pf neW federally 
funded educational programs:. The f^rst years after the passage^^^ofM^eS 
Elementary and Secondary Education Act (ESEA) of 1965 saw ev^rluative 
chaos — every new project was judged by a unique plah.^ That* many of 
the result:kig evaluations would be inadequate was inevitadple, Few 
educators possessed the skills needed to conduct effective evaluation. 

A few scholars developed generalizable evaluation plans . Such 
models were primarily developed by educational researchers, who attempted 
to compromise the rigor of traditional experimental design with.the 
demarids of programmatic educational activities. Although the- models 
did not answer all of the problems of local evaluation, they did provide 
general guidelines that eliminated some of the deficiencies of early 
ESEA evaluation' reports . These models have been prim^ily nonquanti- 
tative in nature. They consist of general organizational ^j^atterns 
listing the sequence c^f^valuation activities and key dfecisJ^on points. 
In view of this statistical vacuum, most evaluators continueV to use 
the traditipnal statistical inference procedures taught in mc^st 
graduate schools of education. Such proji^dures *ure usually appropriate 
in educational research; hpwever the c^^fl^erences between research and 
evaluation ar^^ wide enotlgh that a common statistical methodology should 
hot be assumed. . ' 

EVALUATION VERSUS RESEARCH 

Despite the fact that both evaluation and research^ can be classified 
as disciplined inquiries demanding empirical support, logical analysis, 
and openness to public scifutiny, there Jgce some definite distinctions 
between these two educational activiti^» The distinctions are priftarily 
a matter- of degree rather than kind. Alditionally , the extent and tj^e 
of dif fer.ences will depend upon the theoretical modeO^s \inder considera- 
tion. However', for the purposes of this- pap^r it, will be sufficient to 
note that "evaluation is the determination o^:^k vjorth of a thing'"^ 
wftfereas "research is the activity aijjied at obtaining generalizable ' 
knowledge by contriving and testing claims about relationships among 
Variables or dBscrxbing generalizable phenomena IWorthen & Sanders, 
1973, p. 19]." Other distinctibnsf^hich may be noted between research 



and evaluation are not necessarily inherent dn the two activities but 
rather xefiect^the way in Wfiich they have been characterized in 
practice. " ^ v \ v 

^ The rule? of legitimate evidence^ for evaluation data aiee broader 
than for research data. Prior to a research investigatibn^ an experi- 
mentel' must determine^ the exact dependent variable which willxbe . 
measured. This i3 not to say that other bel^aviors will be ignored, 
but thei^j? consideration must be secondairy to the one specified' in the 
hypothesis. . No bne would object to the reseeurcher* looking elsewhere 
for support should the specified behavior- not support the hypothesis; 
however,' this steurts an entirely new investigation with a new hypothesis , 
new design and new data. Researchers have always looked with scorn at 
£^st lioc analysis a3 something less than scientific. 

Evaluation activities are more flexible. With emphasis on the 
determii^atibn of value, the evaluator's duty is to consider evidence 

-from' as meuiy emgles as possible. For example, the evaluator may be 

.rec[uested to measure the effects of a flexible schedule on a number 
of i,ndependent student projects. If the evaluator notices that such 
a schedule seems to result in higher student absenteeism, this data 
may be'^ery important in the evaluation even though it is a side effect. 

Research activities generally do not provide for feedback loops. 
The sequence "problem - hypothesis - sample - data - inference [Hays, 

2973 ,^ p. 856]" is usually^fbllowed until completion. | In fact, frequently 
the researcher may assign most of the activities to rather unsophisticated 

,assist£mts ^o must simply follow the research *plan formed by the ex- 
perimenter. After the data are analyzed spinoffs may arise, but the 
experiment at hand has ended- Even an overview of the major evaluation 
models will demonstrate the extensive feedback system operating in the 
models. For example, Stufflebeam (1973) presents a system which is 
based on continual looping between decisions, activities, and evaluation. 

Parsimony in science recjuires the avoidance of unnecessary signifi- 
cant effects. The cpnsecjuences of asserting that a vax^able has an 
effect when it doesn't (Type I error) are generally considered worse , 
than saying a variable doesn't have an effect when it does (Type II 
^rror) . Traditionally, researchers are willing to make Type I errors , 
only five or one percent of the time, while the frequency of Type II 
error:s is rarely considered- The .dangers of a Type I error are always 
in the forefront of an experimenter's mind. Before a variable is -given 
a role in a theory, it is careJlilly examined to pee if it is necessary. . 
Hence a good rese^aurcher is characterized as being cautious , conservative,, ^ 
and skeptical. 

However, the evaluator must constantly juggle the relative costs 
of a Type I or Type II error. For example, a university may have devoted 
years to developing and implementing a competency-based teacher education 
(CBTE) program. Suppose that cm evaluator were to compare such a program 
with a traditional teacher education program. The null hypothesis would 
postulate that both types of programs were ec[ually effective. Because 
of the added expense of a CBTE program, a finding of no difference might 
result in the university's reverting to a traditional program. However, 
if the CBTE program were found to be superior, little change in SLcttfrity 
would occur; the program would simply continue. In this situation, a 
Type II error would signify that the lengthy development of the CBTE 
program was in vain; ^as a result # the university might decide to switch 
back to a traditional program. Of course, in other situations a Type I 
error might still cause the most damage. The point is that in evaluation 

V 13 



the Qonsequerices of both^ype I and Type II errors can vary considerably, . 
depending'*on the circumstances and costs. The traditional procedure of 
setting a Type I error probcibillty level at .05 or J 01 does not involve con- 
sideration of relative costs • . ' ^ 

I* ^ • ' ♦ 

' F/(liURE OF CLASSICAL STATISTICS IN EDUCATIONAL EVALUATION ' 

Classical Neymeui-Pearson statistics provides a good model for typical 
research. * Ij. problem is first recognized, whether from an unexplained 
variable in a theoretical model or just some quest:^ arising put of daily 
experience. From this problem a researcher postulates a hypothesis about 
a state of nature. For example, "Does a 15-minute work break in the morning 
increase total morning output?" To test this hypothesis, the researcher may 
define his population as all office workers in the city. He will then randomly 
V place 20 such office workers into one'^of two experimental grot^s. One group 
will receive th'fe coffee break and one will not. The total morning output will 
'be deteignined for both groups.' Using classical statistics, it would be possible 
to test for a difference in the total output of the two 'groups that is suffi- 
ciently large to be unlikely due tc» chance. 

* This examp2,e illustrates the following points. First, the assumption 
is made that our reseeurcher has no knowledge of the outcome of the experiment. 
In reality the experimenter obviously expects certain results or he/she 
.would probably not conduct the experiment. In fact, it is because he/she 
has some knowledge of expected results that th« ironical situation arises 
of the experimenter's assuming that he/she has no expectemcy. Hopefully 
the research will be public and will not be subject to the demand characteris-* 
tics of the experimenter. The experiment is to be a microcosm where' the 
effects of variables will be assessed strictly in terms of what occurs in 
the experiment itself. All outcomes will based j^n the likelihood of a 
sample result, not on any activity which is determined by the prejudices of 
the experimenter. If the differences between" tl^ two groups are "statis- 
tically significant," we then conclude that the variable does have an effect 
and it can be included, in our theoretical model. Howe^^er, i*f the results 
are not statistically significant, 4.t "cauinot be concluded that the variable 
does not have an effect. All we can say is that there is no evidence to 
include the variable in our model. This technique is appropriate for the 
slow process of theory building. However, an evaluator frequently does 
not have the chance for noncommitment <^at is given to the researcher. 
Given that the evaluator is forced to make a decision, failure to reject 
the null hypothesis will probably lekd to the conclusion that the progrsuns 
were equivocal. However, in doing this the rules of inference are being ^ 
compromised. Classical procedures were not designed for the purposes of 
effective decision making, but rather, for making assertions about popula- 
* ion parameters (Edwards et'al.tBp.963) . 

Secondly, when an evaluator is hired the employer wants to utilize 
ail of the expertise available at that time. The classical approach does 
-fot allow for this since the researcher must be considered ignorant of 
expected outcomes — only the data can provide information. In reality the 
evaluator or other staff members should have considerable ^knowledge about 
the situation. . 

Finally, traditional statistics does not lend itself to the feedback 
system of evaluation. Inferences are strictly limited to the sample data ' 
lahder current consideration. Any additional tests will be based on new 

: 14 



sample data. Frequently in evaluation, if conclusiona/cannot be clearly 
made on available evidenpe, more data, perhaps -of a slightly different 
nature, will be ''examined. With the classical approach, the first set of 
data would be examined and tested independent of th» second set. Thus, if 
proceeding through a feedback loop in an eyaluatldl model, the test of the 
new data must ie conducted ^d analyzed independeifft of the first set*' 

The conclusion to be firawn is that the classical statistical procedures A 
utilized in research are ineffective in evaluatoon for the following reasons: 

(1) The classical approach does, not ^llow for maxliaum xitilization* of the 
expertise available from thQ ^valuator and/of projecr staff; (2). It can only 
reject the null hypothesis,, never accept dtj (3) The classical approach 
cannot effectively consider relative costs/of Type 1 or Type II errors; 

(4) It cannot handle feedback loops, whi^ increase the, relative store of 
information. . / \ 



/ 



15 




■ V 



■ . Ill 
BAYESIAN ANALYSrS 



It should be emphasized tip all of the mathematical concepts used, in 
• traditional statistics are available to Bayesian statisticians . However, 
J atme of the interpretations of the classical concepts are broade* among - 

l^yesians. Perhaps the most importsht of these concerns the issue of pro- 
, laability. ' ' - 



[DEFINITIONS OF PROBABILITY 

At least three separate "types" of probabil^.ty can be, distinguished: 

1. A priori probaliility . This type of probability follows the branch - 
of pure mathematics called the. calculus.^of - chtoices. It is not based on any 
data that have been gathered, but rathet on lo^|i assumptions that are 
made about the nature of events;' It often rel«|»i physical characteris- 
tics, such as symmetrj^. The toss* of a coin or ifKe is a popular example ^ ^ 

of symmetry. " ^ ^ ji 

2. statistical probability . Probabilities of this nature are estimated 
by observing the ration of the frequency of occurrence of some event to the 
total number of" opportunities that are available for the event to occur. It- 
is the relative frequency of a given kind of event or phenomena , within a 
class of phenomena, usually called a "population." For example, to say ^ 

' that the probability of a particular child being bom male is .52 means that 
of all known births, .52 of them have been males. This is a case where ^ 
statistical probability may not correspond ,to a priori probability, since 
the a priori probability of a male being born is .50 according to the 
principle of indifference. Of course, there may be some eacplanation for 
this discrepancy which will change the a priori probability. 

3 Subjective probability . The third type of probability is distinctly 
Bayesian and is perhaps the ^cornerstone of the differences between Bayesian 
and classical statistics. It is important to note that the Bayesian notion 
of subjective probability cannot be properly criticized on mathematical 
grounds, since it employs the same mathematical machinery as other concepts 
of probeQjility. 

If Bayesian methods are to be criticized, the criticism should 
be based on their intuitive reasonableness and appeal, the reality 
of their assumptions about human behavior, their pragmatic value, 
their place within empirical science and so on. They are not 
properly criticized on mathematical grounds. . . . There is no 
question of the formal validity of the probability statements 
such methbds yield, so long as we play within the formal rules 
of the game. iHays, 1973, p. 810]. 



ERIC 



16 



8 



= . . • - m-^ . .. 

The notion of subjective probability can* be defined as follows. 
A probability value is a measure of strength of an individual's opinion 
or belief afiout the existence of some s-ituatipn or the occiirrence of i 
^ some event. This is distinctly differJ^nt* from a relative frequency / ' 
" prcA^ability; Edwcurds, Lindman> and Savage (1963) indicated that with 
personal probability you cure saying scanething abo^t yoxirself as well 
as the event you are trying to predict. For example, if I were told that 
the probability of .getting a head ijr a coin tqss is .5, I would under- 
stand that according to traditionaA ^deas of prolpability, with a large 
number of flips, half of the resurte would be heads. However, if I were 
^ told that the coin to be used had either 2 heags or 2 tails, the tradi- 
tional meaning of probability wouldn't operate, yet I would prpbably 
place bets in line with jmy ^former prediction of .5. 
* ' • ^ . 

THEOREM , - 

Bayes' Theorem is essentially an algebraic relationship by which 
prior probabilities are revised in view of additional data to. obtain 
posterior probabilities. This relationship is most useful in situations 
involving subjective prior probability distributions. 

Let P(H) equal the probability of a hypothesis bfeing true prior to 
data collection. This is* defined as the prior probability. It is a % ■ 
subjective probability although' it does not exclude the possibility of 
elements of a traditional frequency model. The stVtement "The probabi- 
lity ,that A person walking in my office is a- col legfev senior is .20," could 
be .considered a sub jective probability reflecting opwion prior to data 
cojlectioh. This statement may be based on a frequency concept o1^^pd^oba- 
bility, if perhaps 20 perbe'nt of all persons on campuS cure seniors*; or 
it may be more siibjective, being based on the number of -seniors that I 
know, past ekperience with people walking in my^ office, or even a non- 
specific "gut" feeling. ^ 

According' to McGee (1971), this kind of probability can also be ' 
expressed in terms of odds. The odds on a statement are determined by 
the probability of a statement divided by the probability of its denial. 
If you offer three to one odds in a bet on a foatball game, this is the 
same as saying you* feel that your chemce of winning the bet is .75. The 
odds you offer are usually based on both objective data> such as the 
frequency of your team winning, and subjective feelings about the physical 
condition of both teams, where the game is played, and many other factors. 

Let P(D/H) equal the probability of the hypothesis being true after 
data collection. This is usually the most public probability since 
many modes ar-e available which are widely accepted by resecurcTiers. This 
probability provides the likelihood function . Traditional studies are' <^ 
heavily based upon this type of conditional probability as reflected in 
the statement "What is the probability of obtaining a sample mean as 
great as X, given that the hypothetical population mean is u?" 

. . ' Let P(D/H) = ^^'^'^"^ 



P (H) 



17 



. ■ \ ■ 

The probability (PD) , or "the prcbability of the data" is .of little 
intuitive interest and primcurily servies a standardising rolel It is 
defined thus: ^ - ' * - ' * 

^ P(D) = P(D/H ) 'P(H ) ^ , , 

for each ^i^Blr^^f^L^7e hypothesis H . ^ \ , / 

A litf^^|^a'3|rebra now leads to til^ basic form of Bayes" Theorem: j - 



P(H/t)) i6 the -probability that the hypothesis is true on the basfs 
both the initial probability 'P{H) and the experimental data i , 

Thus, in using Baye^' Theorem, a prior probability P(H) ^jt^formed 
based on any information available, incltding logic or' intuition. 'Then' 
data Are collected from a sample arid (a likelihood fancy-ion is fopned 
tP(D/H)J. This information is then used to refine the' prior,<|^<*abilityK 
which results in a'cposterior prci)ability. Bayes' TheorM|p»'''can be reformu- 
lated to apply to , continuous paurameters. . If a >paLraiag.fcwr n has a prior' 
probability density function Vi(n) r and If X a x^idom variable for 
which^ V(x/n) is the density of x given r\, and v<X) As the density of 
then the posterior probability density of r\^y^veti x is 

FORMATION OF PRIOR PROBABILITY DISTRIBUTION tP(H)] 

In classical statistical /"testing, the only sbiirce of relevant 
information about a population is th§ results.. gained frcm a sample • 
Of course certain assumptions are held aj>out the basic^popUlatipn dis- 
tribution, but nowhere 'does the pirior opinion of the evaluator enter 
explicitly as it d0es in Bayesian analysis. Each sample is drawn as 
though it were th^ first of Its sort ever taken. ' . 

If an impbrtcuit decision "is to be made, logic would dictate the 
use of ail relevant information. In reality, this is just what most 
.evaluators do. While they collel'ct their "hard" data, their intuition will 
predoninlite should the , data turn out to be nonsupportive. Through 
rationalization, internal analysis of* the data.,^ or §econdciry findings, 
the evaluator will somehow- let" her /his subjective opinion interact with the 
statistical results. The Bayesian reseaircher attempts to let his/her ,in- 
.ttii-tioh en1:er the decision making probess' through the prior probability 
:iiai3t^bution .rather thcui through the back dioor. 

By definition there ^re no explicit rules for determining the shape 
ofr'the prior, probability distribution. However; as noted by Edwards 
^t/al. (1963) there are times when the shape of the initial diistribution 
/Will be very important in the final outcome: . . ' ./ 

1. If small prior probabilities are initially a3signed to axeas 
where the data indicate the pcircuneter is located. 

2. If a large probability is assigned to a region where the data 
are nonsu5>portive. 



18 



10 



3. If both experimenter's prior opinion and jthe data are diffuse, 

4. If observations are expensive and relatively few can be made. 

5. ' If major decisions have to be made pripr to the collection 

of much data, such as samplq size. » 



AN EXAMPLE OF BAYES' THEOREM 

Consider a situation where an evaluation team needs an estimate or 
the Wechsler'X.Q. for a school district when no test results, are readily! 
available. The evalxiatioh team should utilize teacher conments, observa*- 
tional data> achievement test refsults, and aAy other information that is 
available. The team may then meet together and collectively decide the 
shape of the probabiliMty distribution. 

In the absence of ^ any information^ the best estimate of the mean 
I.Q. may form a normal distrtjjution with a mean of 100 and a standard 
deviation around 15. That is, if they' were to knpw nothing about th^ 
school district, they would probably asgjome that.it is .average. The g^O'- ; 
babilp-ty of an average school district havingh-a mecui I.Q. above 130 is 
not. very likely, in their opinion perhaps no more than two or three per- 
cent. V r 

However, one of the- staff members had reading scores that indicated 
these stiidents were a yeat behind in reading. Also, several teachets Ire- 
marked that the students seemed rather "slow.*' Becaxise of this additional 
information, the eyaluatipn team may decide that a ppsi^tively skewed dis- 
trib'ution, with a mean of 95 best describes the^roup opinion. Although some 
information i^lies these students may be belq^ normal, such informatioa' 
vis not totally reliable and in fact the studeats' alight be normal. Therefore, 
even though the group opdnion is that the mecui I.Q. score is most likely 
around 95, they may feel the chances are greater that it is above 95 than 
below 95. A positively skewed distribution would reflect this ^feeling. ^ 
' Assiame for the moment that the decisipn has been made to Actually test 
'a sample of 'children with Wechsler Scale, this Information would then be 
incorporated into the prior probability distribution to form the posterior 
opinion. The technique used to accomplish this would be Bayes" Theorem. 

An example from Hays (1973), will help to solidify what has. been pre- 
sented.. Suppose this time the evaluation team is concerned about the \ * 
academic ability of a single child. There is general agreement among psy- . ^ 
chologists that I.Q. is best represented as forming a normal distjfibution 
with a mean of 100. Conisider the followinc 



130 - 
115 - 
100 - 
85 - 
70 - 
less - 

This is the prior probability distrib^tiorr P(H) for this example. It 
is a subjective probabi]fi.ty distribution but there is widespread agreement 
about its shape. In thi^ si^tuation, it may be the distribution for the 
Wechsler. ' 





Id 



ERIC 



Suppose that a two-item test is given^t© a large groi:^! of children. 
the distribution of test scores given each leyeP* of jI^Q. is <ibtained. This 
^would be a conditional probability distribution showing the distribution of 
I.Q. scores for each score on the two-item tiestJ This woiiid pirovide P(D/H) , 
that is, "What is the probability of an experimental outdome p, given the 
prior probability distribution H?" To dete^ine this, distribution scores 
would be needed for both I.Q. and for the two-iteta test. . * 



I.Q. 



1 



PROBABILITY (D/H) 



130 - greater 
115 - 129 
100 - 114 

85-99 

70 - 84 
less -v^9 



TEST SCORE 




'^^^cores given 
ability that 
information 
on our prior informa- 



Each* <?o3.iamn represents ar^nditional dOi^^ 
intelligence level. Suppose avbhild is brou< 
his I.Q. is" ever 130 must be quickly determija<9Si«| 
about the child, the probability would be .02*^<^^ 

tion about the distribution of "^intelligence. Ndw|^uppose that tH|H^hild 
is given thei brief 'test. If the /child scored 2r/^e probability of^ui !■ 
over 130 is .80 on the basis of .t:his test infoim^'ion "^lone. However, a 
Bayesian would be xinconfortable yith a probabi^ptsy this high, ^ince from 
previous experience very few people have an I^jg". over 130. The score "of 
a ^single test should be considered^ but it- should be tempered with-' prior 
feelings about rarity of the ^verit. This "is done through the use of 
Bayes V Theory, a^presented earl;Ler : . ^ 



P( 130/2) = 



(.80) (.02) 



(.80) (.02) + (.50) (.14) + .:^4-(.05) (.02) 



.016 
.346 



.046 



Using the same procedure, other rvalues would result in the following 
post^l^ior distribution for those children scoring 2 on the test. 



I.Q. 



PROBABILITY' (H/D) 



130 


greater 


. .046 


115 


- 129 


> .202 


100 


-114 * . 


.393 


85 


- .99 


.295 




- 84 


.061 


less 


- 69 


.003 



By comparing the' posterior probability distribution P(H/D) with the 
prior pl^^bability distributic^ P(H), or\e can^see the effect of the additional 
information. I.Q. values' greater than 100 have a higher probability than* 
with the prior 'distribution. The probability of an I.Q. 'score greater than 
130 has ^increased from .02 to .046. > However, the change has not been nearly 



20 



1-2 



so great as if the data alone wdre considered. Thus, for a Bayesian, if 
am event ^s seen as rcire, one' set of data will- usually not drastically 
change the probability of that' event's occurrence. With the'Bayesian 
model, this posterior probability distribution caji^ be considered a prior 
probability for amother round of data collection. This process can 
coQjLinue,i, with each sdft of data being integrated with^ the prior distri- 
buti9n. Unlike tJie classical approach] np set of data cam stand alone." 



HYPOTHESIS TESTING. . ^ " . ' 

In addition to" the notion of sxjbjective probability, the 'Bayesian 
position on hypothesis testing would al«ti elicit gre^t negative response 
from classical statisticians. The Bayesian position is best reflected in 
'a statement by D.L. Meyer: f - ^ ^ . 

Our teaching must be revolutionized to the point where 
topics such as confidence intejarval's amd tests of signif icaince^ , 
are taught almost as an afterthought for those fortvinate enough 
to have a formal model, in discussing' the first course in 
statistics with the staff of a certain University, .they told me 
that descriptive statistics is' taught^.-iix only two weeks so that 
"we get to the important ^tuff — inferenpe—qi^^ly. " I am ^ 
proud to report that at Syracuse, we spend a fyll semester .on 
the importamt stuff— descriptive s^tistics (1975, p. 4) . ' 

So rarely do Bayefeiaois have reason to use inferential te^^ that a 
discussion of Bayesian statistics .could iSe relatively complete without 
bringing up the issue.- However, because it is so important in classical 
statistical thought it cannot b^.^^avt)ided entirely. - ' 

The most popular notioi|||fef a test is a tentative decision between 
two hypotheses on the basis of data. This usually leads to a choice 
betwein two actions, such as whether to include a supplementaury reading 
program in the curriculum* or not. With this purpose in niind, the use- 
fulness of the null hypothesis *is que^tionaJble since the null hypothesis 
cart be expected to be false from the beginning (Edwards, 1963)/! The ' 
differential treatment of two groups, regardless of the natureVof the treatr 
ment^f is usually sufficient to result income difference in score^on the 
(dependent variable (Edwards et al, 1963). This is probably magnified- in 
educational environments, where placebo effects are frequently significant. 

\pf course Ihbre is at stake tham simply choosing between two options, 
since^economic matters will enter in determining what activity is actually 
/followed. One would hairdly choose to implement an expensive reading program 
[unless there were clear evidence of the likelihood of substantial improve- 
ment. The whole issue of statistical versus educational significance makes ' 
clear the hypothesis- testing procedure is taken only half seriously. 

This problem is avoided by the Bayesianis since hypothesis testing 
does not hold any particulair relevance. That particular probabilities sjich 
as .01 or .0,5 are highlighted for special treatment is without special 
interest to the Bayesian. In deciding between two alternatives, simply 



21 



13 



reviewing the/ir posterior probabilities will usually reveal the appropriate 
decisrbn. Ajriy further analysis is likely to be based upon'utility values 
(in the language of decision theory) rather than pnl a p^rticulcir probability 
value . . . ) * • 

For example, if two math programs are being cmipared ahd both involve 
the same economic factors, then the program that seems to be better should 
be adopted, however slight the superiority. If one program is more expensive 
than the other, then a payoff function may be applied to accourrj^^-^oij cost 
'differences. If one program is markedly liiore expensive than the otper, then 
it will only be adopted if there is^ clear evidence of its superiority. A; 
traditional significanbe level of .01 may laot be* enough to justify the rel- 
ative difference in cost. Finally, even if the r^elative costs are the same, 
if one program appears to have a stronger theoretical foiHidation, then we - 
would require rather convincing evidence to justify adoption of the alter- 
native program. Bayesian analysis, through the use of the prior probability, 
can account for thfs prejiadice;* traditional, inference cannot. 

As noted earlier, .although Bayesian^ statistical inference does not 
require the testing of hypotheses in the traditional - sense, testing can 
occur sinctf^ the probability laws that apply in classical statistics also 
apply to Bayesian analyses, only the meaning is diffe^yent. Therefore, the 
same testing techniques available to classic inference can be legij:imately 
used by Bayesian?. This difference in meaning, however, can lead to situa- 
tions where a classical statisticia«T may reject a null hypothesis while a 
Bayesian will see the same data providing support for the null hypothesis. 
Edwards et al. (1963) developed this argianent in detail; it will not be , 
reviewed in\ts entirety here. However, the intuitive argument can be 
made more quickly. 

If a true null hypothesis is 'being tested under- a one-tailed t-test 
with a large sample, a Jb-value between 1.68 and 1.96 will occiar two percent 
of the time. Of course, if the null hypothesis is false, the' frequency of 
this outcome will depend on the alternative hypothesis. However, given the 
condition of no prior knowledge as specified with classical inference, a 
'uniform distribution of alternative hypotheses with t-values between 0 and 
20 is not unreasonable. Under such a condition, and assuming the 'null 
hypothesis ±^ faljbe t_ will fall in the range from 1-6& to 1.96 ^t most 
1.40% of the time (i.e., 1.96-1.68/20 = 1.40%). Tlius a Jb-value would occur 
in this interval more frequently under the condition of a true null hy- 
pothesis than under a diffuse alternative hypothesis, despite the fact that 
the' classical statistician Would reject the null hypothesis. Obviously, 
ainy answer to this argument depends upon the shape of the prior distribution 
of the alternative hypothesis. The classical statistician cannot handle 
this problem s^nce he does* not consider the shape of such a distribution. 

To continue with the current discussion would lead to theoretical 
issues beyond the scope of this paper. However, the point must be made 
that although Bayesian statistical analyses do allow for the use of classic 
testing procedures, sometimes the outcomes will not coi^ncide. Since 
Bayesians usually hold that traditional testing procedures are simply a 
nominal statistical exercise, the question of how Baye|ians make decisions 
remains . . * 



22 



14 



DECISION MAKING - " 

Upon reflection, the classical approach of rejecting a .sharp null 
hypothesis doesn' t really provide much information. As anyone familiar * ' 
with information theory is aware, the most knowledge is gained, by reducing^ 
the number of optionp in iialf . Given^ the stochastic nature 'of the world, ^ 
one can obviously not eliminate 50 percent of all alternatives with certainty. 
However / one shojiid be able to do better than simply reach a probabilistic 
conclusrQn^jy;ia'toie real value of some parameter does not appear to be at 
one point along cm infinite range of possible values. 

This no doubt was the feeling of McGee (1971) when he noted: 
' • i 

The author had for some ye^s felt the need to consider- 
alternatives to 'setting up a null hypothesis in order to 
reject it and; having considered the position of the Bayesians, 
was a ready convert at an intensive svmraner session in information 
theory. . . . The idea of an experiSfe^ter approaching a sta- 
tistician with the request "1 want to rej^^ct thi6 null hypothesis" 
was totally out of place' [p.^ 277] . 



An example of the Bayesian ajjproach to estimating the probability that an 
I.Q. value was greater than 130 was presented earlier. Now consider a 
situation where a decision between twp alternative math programs must be 
made. Th^traditional approach would be to set up a null hypothesis of 
no difpaf^nce on some test score. The alternative hypottiesis would be 
that^-^ere is a difference between the two programs. A two-tail test is 
likely to b^employedj, This is a typical kind of problem found in educational 
[evaluation. 

An obvious difficulty can cirise in this kind of situation. Suppose 
nonsignificant t;-value were obtained. There would then be no particular 
rjeason for favoring one program over the other and |^ toss of a coin would 
' be ^nl appropriate decision maker. Obviously, the evaluator is given license 
td^^lraw^any conclusion. She/he could simply choose the program which re- 
sulted in^^the sligirtly higher mean test score and use this score as 
partial juS^fi^zfation. This is often done when one hears the report that 
"means were in the right direction/* or "results were just short of signi- 
ficance." According to the classical rules, these statements carry as much 
weight as saying the coin almost landed tails. Clecirly, with nonsignifi- 
cant results, the evaluator* cannot reach any other conclusion thctn to as- 
sume no difference for the mcanent. Suppose she/he plays the game and doesn't 
draw conclusions on the basis of the data. She/he is then likely to appeal 
to other items, such as cost effectiveness/ theoretical soundness, or 
teacher preference. 

When the ^valuator begins to look at this kind of evidence, she/he is 
stepping outside of the public system of inference and becomes opfen to 
criticism. While heterogeneous and frequently qualitative data are useful, 
cm evaluator trained in classical statistics has difficulty in dealing with 
such information in a systematic way. Because she/he has difficulty 
explaining the procedure she/he used to make decisions, the evaluator is 
frequently criticized for being biased or subjective. By making public 
the process whereby she/he used these other sources of information, the 
Bayesian evaluator reduces the chances that she/he will be criticized for 
personal bia^, fuzzy thinking, or irratit)nal decisions. Others may 



EKLC 



23 



15 



disagree with her or his prior probability distribution, , biit the way 
tnat the research data and subjective opinion were brought together 
dannot be disputed, assuming acceptance of the Bayesian paradigm. 

Bayesian Analysis for ah /Evaluation Probiem*-An Example 

Math Program. Y has Recently been developed to replace the standcurd 
Math Program X\ ^School pistri.o#tNo. .32 wants to consider adoption 6f 
Math Program V becuase i/p appeals to a community which considers itself 
progressive. and innovative. Not much money is available for data col- 
lection, so the evaluation will>have to utilize a small sample. These 
evaluators, who have considerable knowledge of math education programs, 
are not enthusiaistic about Math program Y. They feel the program has 
serious flaws and has not been adequately developed. ^ 

.The evaluators, along with school district personnel, have agreed 
to use the scores bn the Standard Math Achievement Test to help with the 
judgment. A lot of data have been collected in the pa^t from schools 
using Program X. Hbwever, the data are old, since the program has not 
been heavily used in recent years. The national norms for schools using 
Progtam X core available on the Stemdcird Math Achievement Test and axe 
as,- follows: 



Score 


Distribution 


15 


.00 


14 


.10 


13 


.io > 


12 . 


. .10 


# 11 


.25 


. 10 > 


.25 


9 


, . 10 


8 


,.10 


1 


.00 


6 . 


.00 


5 


; / .00 ^ 


4 


.oa 


3 


.00 


.2 


.00 


1 * 


. 00 



In addition to the old data, the following information is also avail- 
able: 

1, School District No. 32 is located in an upper middle class area 
where the parents of the students are engaged in skilled or professional 
occupations. ^ , . 

2. Math Program X has been recently modified, and according to scxne 
sources, probably would result in better test performance. 

On the basis of all this information, the task is to determine where 
the value of u , the mean score on the Standcird Achievement Math Test, 
lies for School District No. 32 using Program X. The prior probability 
distr^iDUtion for Program X is presented in Table 1 along with the rest of 
the data needed for this example. 

. 24 



16 



TABLE 1 



J 



PRIORr^ M^A, AND 'posterior PROBABILITY DISTRIBUTIONS 
U< FOR MATH PROGRAM EXAMPLE'' 



.... . , ■ - \ 


Mean ' 




Prior 




; . *Data 


Posterior 


^Test. 




Dis tributions 




Distributions 


Distributions 


Score 






Y 


X Y : 


. . X Y ' 


7"\ — ' 

- X 




\. 20 








14 




.25. 








13 




.25 \ ' 




• 05 




12 




..15 . . 


10 . 


' .90 


.95. 


11 




• 10 


10 


.05 .05 


, % .263 ' .05' 


10 




• 01 


20 


:9o. 


► ■ .474 


' 3 


1 * 


• 01 


20 


.05 


.263 - 


3 












7 




.01 


io 






6 
5 




.01 

V ' • 


10 

10 






4/ 

/ 
3 

2 








t 


) 

\ 


l" 













The prior probability distribution for Program Y is almost totally subjective 
s'ince little data are availa±>le. ^he lack of enthusiasm for the program by 
the evaluation teeun ife revealed by the prior probability distribution. 

Test data are obtained for both Program X and Program Y. Th$ mean score 
for Program X is 10' and for Program^,Y is 12, Howevfer, since there may be 
measurement error in the test data, -some credibility is given to the scores 
next to the obtained • mean (.05 for each score). The posterior distribution 
has been determined by Bayes* Theorem. The results cu:e somewhat similar to 
those obtained with traditional statistics, v^ich would rely on the data 
alone. However, confidence in the true value of the mean for Program X is 
weak. The evaluators aire only 47 percent certain that the value is 10, 
v^^ile tljey are quite certain (95 percent) about the true vaiue f or! Pit.oqram 
Y. This is because the data tend to go considerably counter to their 
prior feelings about the true value for Program X. Perhaps more data for 
Program X would be appropriate. 



1 



J. ■ • . . . ■ . ^ 

J These findings illx^jstrate the geperalization that wh6n the prior 
probai^ility distribution and the data distribution are difsi^ilar, the 
posterior probcibility distribution will be diffuse. . This is logical since 
data that go agaj^nst coinnoji sense will usually result in greater uncer- 
tainty about the real natxire of events than data that support prior 
beliefs. In classical statistics, conclusions-' should be unaffetcted 
prior opinion.'. , , ' / ; 

^^ATA COLLECTION STOPPING^ rCe^^^ " ' ' . \. : V 

'A final distinction between the.^twa inference; techniques Ides in 
their differe^jt treatment of data .stopping rules. - Classical procedures 
require the e:^erimenter to sj5jgf?i:^^iin advancd how much data she/he will 
colJLecfc;^^'^ is c^StSined in most texts"' on gifeneral statistics ' 

. (e.g. . Hays, 1963). Such a requirement is part of^^ ?in overall rule b]!^ 
^clas3ical statisticians to specify a!il data^ collection and data analysis 
activities^ prior to any actual data fathering, with the exception of 
certain post hoc tests, lliis ptindjtjple ic extremely difficult' for an . 
*'*\^gducational evaluator to follow/* sifice his working environment is . ^ 
very ^fluid; initial specification of all data-gathering procedures is 
usually impossible. The classical statistician is so specific onithis 
^ point because of the ease with which a null hyppthes'is/can be reje>cted., ^ 
ii^ fact, with repeated cycles of data gathering and testing, an 
experiinenter eould be certain of re jecting. the null hypothesis even if . 
it were true (Hays, 1963) . ; 

, However, as Edwards' et al. (1962) noted, this is nbt a problem for c 
a Bayesian who does not use the null hypothesis, testing procedure. 

'^In contrast, if you're t^'^but to collect data until your 
posterior probability for a hypothesis v^ich unknown to you 
is true has been reduqed to .01, then 99. times out of 100 you . 
will never make .it, no ma^tter how many data you, br your children 
af te*r you, may collect [p. 239] . 

/ - < ■ 

The cornerstone of this difference between classical and Bayesian 
data collection procedures is known a,s the "likelihood principle." 
This principle flows directly from Bayes* Theorem and the concept of 
subjective probability. It is in operation when two <iiffere^it experi- 
mental* outcomes (x and y)' have tiie same becuring on opinion ai^tout' a para- 
. meter. That is, if P(x/A) and P(y/X) .are proportional functions of X, 
then each of the two data x and y have^ exactly the same thing to ^ay 
about values of 'X. . 



In. the discrete case, if P(D*/H^) = kP(D/Hj^) for some positive 
constant k, thfen the likelihood principle operates. For example, in a 
coin toss, 10 heads out of 20 throws means the same thing as 20 heade* out 
of 40 throws. This simple principle was discxissed by classical statisti- 
cians such as Fisher (1956). However, in classical testing the principle 
is lost, accor.^i^g to Savage (1962) : 



26 



The likelihood principle is in conflict with many 
historically impbrtant concepts of statistics*. For example, 
^' whether* a test , is iinbiased depejids not op the likelihood 
c^one, i>ut| father on Pr(x/X) considered as a function of x 
as Well -^i^ a finction of X. Simileurly with the concepts " - 

^ of significance or confidence level. Fpx instance, it has 
' been widely "believed that the import of such a datum as 6 
red-eyed flies out of 100 depends on whether the experiment 
was designed to observe 100 flies or designed to observe 6 
red-eyed flies. An estimate unbiased for either of these 
experiments'^ is biased for the other [p. 1*^* 

. ' »a ' ' 

Since stopping rules are irrelevant to a Bayesian, greater ob- 
jectivity actually reisults than with thgb ^^^^ional model. Once data 
aace collected, the origijial intentions of the. experimenter^ are" irrelevant. 
Hhe , experimenter ccui collect data until he has proven his point or exhausted 
all his funds, time, or patience.- . * 

Such freedom should be appealing to cui educational evaluator who is 
frequently un^r pressures that interfere with a predetermdLned plan of data 
collectionf^'Frequently a shortage of funds, uncooperative teachers, or^ 
pressvures of time prevent data collection from being completed. Left with 
i-ncomplete data, most evaluators continue to grind out inferential tests 
.despite gross violations of principles of classical inference. Frequex^tly 
these violations go unnoticed^or else are rationalized as being necessary 
to meet the demands of the real wor).d. Violations of assumptions do not 
necessarily reflect negatively on the practicing evaluator, for statistical 
models should rfeflect reality, if a their than force reality to reflect the 
statisl^ggjal model. If the stopping rule principle is violated so frequently, 
then . theHBayesian model, which disregards the stopping rule, may be more 
appropriate for evaluation situations. - 1 , 



IV * . 
CONCLUSION • . 

■ ■ 

At this time, a textbook describing Bayesian cunialyses for educational 
problems does not exist, although there are several texts slanted toward 
other disciplines which should prove useftil to educators (e.g., Morgan, 
1968; Zellner, 1971). 

One should not aSsiime that less mathematical rigor is required in 
Bayesian analysis than in classical statistics; in reality, the opposite 
is true. Classical statistics, with its emphasis on the normal xinderlying 
distribution, has been doci:miented so well that students with little 
mathematical sophistication can perfozm adequate statistical analyses. 
The Bayesicui mode]r, however, requires a good understanding of distribution 
theory in order to adequately describe the prior ^ probability distribution, 
SomV simple analyses using discrete distributions may be within the reach 
of almost anyone; however the full richness of the Bayesian approach may 
not be appreciated without some mathematical sophistication. 

Most evaluators can take some steps to begin to utilize a more 
Bayesicui approach in their daily work. Even if the Bayesian model is not 
completely accepted, the questions "raised by the Bayesiems should increase 
the vigilance of the evaluator to avoid gross violations of the classical 
model. Freqiaently, when a model in a given area is widely used, the 
assumptions lander lying the model are taken for granted. (One needs only 
to look at the 19th centiiry Newtonian physicists or ecurly 20th century be- _ 
haviorists to see this problem.) 

The following recommendations are made and should be easy to im- 
plement: 

1. Don't parade statistical procedures in ah attempt to add respecta- 
bility to a subjective process. When an evaluator of any scientist attempts 
to generalize beyond his data, he is engaging in a subjective process (Edwards 
et al., 1962). The mathematical models may be useful, but they do not auto- 
matically objectify emy inferential .process . 

2. 'Realize the ease with which the null hypothesis is rejected. Just 
^cause one has been able to reject a null' hypothesis at the .01 or .05 level 
does not necessarily mean that something of educational siignificance has been 
found. As noted by Savage (1962) , null hypotheses are frequently rejected 
inappropriately, and even if appropriately tosed* little of any practical sig- . 
nificance can be concliided frcxn the rejectipn of single null hypothesis. 

3. Report probability leyals when possible.. Instead of using the 
magical .05 or .01 levels of null hypothesis rejection, the actual proba- 
bility levels for the alternative hypothesis should be reported. ,Mo^t^ 
evaluators are familicir with power cind power functions, but they cu:e rcurely 
discussed beyond a first course in statistics^' 



19 



20 



4, Specify prior opinion. Since inost activities in evaliiation 
involve hypotheses where the evaluator is not neutra^r.^ prior opinion and 
the reasons for this opinion are legitimate information* Instead of using 
this information covertly when drawing conclusions, openly expressing the 
initial bias may . be more appropriate. If expressed in probability terms, ^ 
Bayes Theorem could be applied to revise such opinion. Conventional inferen 
tial analyses could still be performed if desired. 

5. Remember the data stopping rule. If the evaluatop is determined 
to test a sharp null hypothesis, the size of the sample must be specified 
in advance and sequential testing of data must be avoided. Also, the 
signifii^^^nce level should be determined priojf to data collection. I 



V 



V 



SUMMARY 



Numerous arguments have been presented in support of the appropriateness 
of the Bayesian model for the educational evalxiator. For the most part the 
evaluator is engaging in appropriate practices, but the classical statistical 
model does not describe how she/he really works. By changing models, the 
evaluator will be able to continue doing what she/he already does, yet she/he 
will be better eO^le to explain to othe^ how it is done. 

This does nq^jnean that the nise Bayesian model will not reqiiLre 
any changes in praKice. Such a model requires greater specificity in 
situations v^ere no rules existed, such as expression of bias towards one 
progrcun or the other pri6r to data collection. However, such changes should 
2result in a new sense of freedcm since the evaluator can admit that numerous 
types and sources of data resulted in her/liis decisions and at the same time 
she/he can stay within the limits of a credible statistical model. 



I 



30 

21 



IIEFERENCES 



Bayes, T. Essay tovi^rds^ solving a problem In the doctrine of chances. The 
.7 Philosophica Trai^action , J.963 , 53 , 370-4ia. (Reprinted from 
BicMaetrika ^ /ISSB^ 45^ 293^15> .•■r.::gr- 

Edwards, W. , Lindman, H. ; & Savage, L. J. Bayes^an statistical inference 

for psychol(l>gical researbh"! Psychological Review ^ 1963, 70^(3) , 193-242. 

■:y-t ■ ■ ; . . ■ . ^ ^ . • . - . • 

Fiisher, R. A, Statisticfal methods and scienQ.fic inference , Edinbxir^: .p 
Oliver and Boyd, 1956. ' 

Hays, W. L. Statistics for the social sciences (2nd ed.) . New York: 
Holt, Rinehart and Winston, .1973. 

McGee, V. 'E. Principles of sliatistics; Traditional and Bayesian . New 
, York: Apple ton-Century-Crofts, 1971 

Meyer, D. L. Bayesian statistics. Paper presented at the annual meeting 
of the American Educational Resecurch Association, New York, 1971. 
- (ERIC No. ED 051 261) . 

Morgcoi, B. W. An introduction to Bayesian statistical decision processes . 
Englewood Cliffs, N. J.: Prentice-Hall, 1968. 

Savage, L. J. The foundations of statistical inference . London: Methuen 
and Co., 1962. 

f ■■■ ■ 

Schlaifer, R. Introduction to statistics for business decisions. New 
York: McGraw-Hill, 1961. ^ 

Stuff lebeam, D. L. An introduction to the PDK book; Educational evalua- 
tion and decision-making. In B. R. Worthen and J. R. Sanders (Eds.), 
Educational evaluation? Theory ctnd practice . Worthington, Ohio: 
Charles A. Jones, 1973. 

lliompson, G. E. Statistics for decisions: An elementcury introduction. 
Boston: Little, Brown,- & Co., 1972. 

Weber, J. D. Historical eispects of ,tJie Bay esiem controversy . Tucson, Ariz. : 
University of Arizona Press, 1973. 

Worthen, B. R. , & Sanders, J. R. Educational evaluation: Theory and practice. 
Worthington, Ohio: Charles A. Jones, 1973. 

Zellner, A. An introduction to Bayesian inference in econometrics . New York: . 
John Wiley cmd Sons, 1971. 



31 



23 



National Evaluation Committao 

Fiancia S. Owar. Ouiiraion 

Emetitus Piotmor 

Univraiity of Chicaito n 

H«ltn B«in A ' 

Pait Pittsidmt ^ / 

National Eduaition AsMciation * 

LyWBouAw 

PrafcHof 

Univ«ti ity of Cploi^KiD ' ^ 

SueBotl 

NatioiwI Evaluation Coawittae 

Roald F. Caatpball 

Eacritttt Prafaaaor 

Tbt Ohio SUte Univcvaity 

Gtotge Dickaon 

Dtan. CoUcga of Kducalion 

Univctaity of Toletlo^ 



I jiny R. (loulet » 
Profeaaor 

Untvcnrily of niinoia . 
Chester V. Hania 
profeaaor 

Univeiaity of California • Santa Rnrbani 

William G. Katxenaieyrr 

ProCeaaor 

Duica Uhivcraity 

Bothafa flwaipaan 

Superintendent of Public tnatmction 

Slate of i^iaconain 

Joanna VillianM 

Prtifeaaor 

Teachets College 

(?ulumbia tlniveraity 



UnivorsHy Advisory CommKtta 

John R. Palawf, Chainaan 
Dean School of Education 
William R. Buah » 
Deputy Diiector 
R&D Center 
David EvCtonon 
Dean 

College of Letlera end Science 
Diane H. Ekh 
Specialiat 
R&D Center 
, Evelyn L. Hoekcnga 
Coordinator 
R&D Center 
Dale D. Johnaon 
Associate Professor 
Curriculum and Instruct ion 
Hediert J. Klausateier 
Member of the Aasocialed Faculty 
R&D Center 



James M, Lipham 

Member of the Aaaociatfd Faculty 

R & b Center ^ 

Wayne R, Otto 

Associate Director 

R&DCenter ^ 

Richard A. Roaamiller 

Director 

R& DCentcf 

Elizabeth J. Simpson 

Pean 

School of Faaiily Reso u rces 
and Consumer Sciences 
Len Van Ess 

Associate Vice Chancellor 
Univeiaity of Wiaconsin - Madison 




Associalad Faculty 

Vcnon L. AUen 
PioCeasor 
Psychology 
B. Daan Bowles 



Educational AdminlatraCion 
Thoaiaa P. Carpenter 
Aa aiata nt ProCtsaor 
Cuiiiculum and Ins t iuct kio 

MarvteJ. Fteth 
Ptoleaaot 

Educational Adminiatialion 
John G. Harvey 



Cufzknlom and festnctioQ 

Prank H. Hooper 

ProCeaaor 

Child DevalopMit 

Herbait J. lOansmciet 
v:a.C. Heiaaen Profeaaor 
Edacatioaal Paydiotogy 
Joaepiir Laarton 
Aaaiatant Pioleaaor 
Child Development 



Joel R. Levin 



Educational Paychology 

L. Joaeph Una 
Ptofessoc 

Institutional Studies 

Jamea M. l.^)ham 
Professor 

Educational ^dmintstration 
Donald N. Mclaaac 
Profeaapr 

Educational Administ ration 

Gerald Nadler 
Profeaaor 

Industrial Bi gin aai hig 

Wayne R. Otto 
Profeaaor 

Cunicttlum and Instruction 
Robeit G. Petzold 
PioleaMr 



Curriculum and instruction 



33 



Thomas S. Popkawitz 
Aaalstmit Profeaaor 
Curriculum and biatruction 
Thomaa A, Romberg 



, Car ri cu l um and Instruction 
Richard A. Roaaadller 
Profeaaor 

Educational Administration 
Dennis W. Spuck 
Aaaiatant ProCtaaor 
Educational Atkainj 
Michael J. Sufakovi^ 
Aaaiatant P i o C t a aor 
Educational Psychology 
Richard L. Vanexky 




Coaiputer Scieitcaa 
J. ^^ed Weaver 



Curriculum and bstractkxi 
tvtyM. Wilder 



Chitd^Developaient 



